Support Mixed precision & Static MSE PTQ in MCore export; Nemotron Super v3 NVFP4 recipe by jenchen13 · Pull Request #1363 · NVIDIA/Model-Optimizer

jenchen13 · 2026-04-28T18:33:04Z

What does this PR do?

Type of change: New recipe + Bug Fixes

Add a YAML quantization recipe that roughly mirrors Nemotron 3 Super NVFP4 hf_quant_config.json
Export bug fixes
support mixed precision export in MCore by detecting mixed precision layers in HF Quant Config
unify config.json["quant_config"] and hf_quant_config to be the same (matching new VLLM format)
copy .py files properly from original HF ckpt (for reasoning parser etc)
MCore and MSE fixes
Support MCore checkpoint resumed TensorQuantizer with _global_amax field in NVFP4QTensor static quantizer detection -> fixes bug during MCore export for MSE
Restore static quantizer in MCore checkpoint restore as NVFP4QTensor (not TensorQuantizer which can call max calibrate. we want to skip max calibrate for static quantizer during restore)
Fix dynamic block quantizer detection when block_sizes is dict-backed.
Skip dynamic block quantizers during MoE calibration completeness checks and distributed amax sync.

Super recipe

Mirrors the published nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 hf_quant_config.json:

MoE routed experts (mixer.experts..{up,down}_proj): NVFP4 W4A4 weight MSE, group_size 16
MoE shared experts (mixer.shared_experts.{up,down}_proj): FP8 per-tensor
Mamba mixer linears (mixer.{in,out}_proj): FP8 per-tensor
KV cache: FP8
rest: not quantized

Usage

# Add a code snippet demonstrating how to use this

Testing

TODO test in HF and MCore PTQ on Nemotron model

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Added three new quantization recipes for Nemotron-3-Super-120B with NVFP4 and FP8 calibration strategies.
- Added configurable FP8 scale sweep stride parameter for fine-tuning quantization calibration.
- Improved per-layer quantization metadata collection during model export.
Improvements
- Enhanced mixed-precision quantization and MoE expert handling across distributed processes.
- Refined KV-cache and attention layer quantization configuration export.
Tests
- Updated quantization export verification tests.

coderabbitai · 2026-04-28T18:33:16Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds FP8 scale sweep stride control to calibration workflows, introduces three mixed-precision NVFP4 quantization recipes for Nemotron-3-Super-120B with different calibration methods (MSE, MSE with stride-4 sweep, max-based), refactors MoE calibration completeness checks to recursively traverse SequentialQuantizer leaves, and overhauls HuggingFace export to collect and apply per-layer quantization metadata.

Changes

Cohort / File(s)	Summary
YAML Quantization Recipes `modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`, `modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`, `modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`	Three new mixed-precision quantization recipes: MSE calibration with FP8 scale sweep (stride 1 and 4), and max-based calibration. Each selectively enables NVFP4 W4A4 for routed MoE experts, FP8 for shared experts/Mamba components, and KV-cache BMM quantizers.
FP8 Scale Sweep Stride Configuration `modelopt/torch/quantization/config.py`, `modelopt/torch/quantization/calib/mse.py`, `modelopt/torch/quantization/model_calib.py`	Adds `fp8_scale_sweep_stride` parameter to `MseCalibConfig`, `NVFP4MSECalibrator`, and `mse_calibrate()` function to subsample FP8 E4M3 scale candidates during calibration (minimum value 1).
MoE Calibration and Synchronization `modelopt/torch/quantization/model_calib.py`	Refactors MoE completeness checks and DP/EP `_amax` synchronization to recursively traverse `SequentialQuantizer` leaves, consistently detect dynamic block quantizers via shared helper, and skip dynamic-block leaves from synchronization.
NVFP4 Static Quantizer Detection `modelopt/torch/quantization/qtensor/nvfp4_tensor.py`	Introduces `_get_static_global_amax` helper supporting both dict- and attribute-style access to `weight_quantizer.global_amax`, replacing direct attribute lookups across scaling factor methods.
Megatron Plugin Sharding `modelopt/torch/quantization/plugins/megatron.py`	Filters `weight_quantizer._global_amax` entries from sharded axis mapping to exclude only global-amax tensors from axis assignment.
HF Export Layer Metadata `modelopt/torch/export/unified_export_megatron.py`	Adds per-layer quantization metadata collection during export, refactors exclusion handling via centralized `_record_excluded_module` helper, enhances QKV handling for unquantized cases with separate projection exclusions, and updates `save_pretrained` to apply per-layer quantization config when available.
Export Test Updates `tests/gpu_megatron/torch/export/test_unified_export_megatron.py`	Revises quantization config verification to require whole-structure equality, updates HF config validation paths (e.g., `config_groups/group_0/weights/group_size` for NVFP4 group size), and adds unit test for unquantized QKV slicing and derived projection module exclusions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.68% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main focus: mixed-precision support, static MSE PTQ in MCore export, and a new Nemotron Super v3 NVFP4 recipe, all of which are reflected throughout the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Pull request does not introduce any critical security anti-patterns defined in SECURITY.md.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jennifchen/super_nvfp4_recipe

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 29-32: The metadata claims attention o_proj is FP8 per-tensor but
quant_cfg lacks any override for attention.o_proj; either update the description
or add the missing quantizer mapping. Fix by adding an explicit quant_cfg
override for the attention o_proj parameter name (e.g., attention.o_proj /
attention.output_projection / whatever key is used in your model mapping) to use
the FP8 per-tensor quantizer used elsewhere, or remove the attention o_proj
mention from the metadata so it matches the existing quant_cfg; ensure you
reference the exact layer key used in quant_cfg to keep mapping consistent with
the model.
- Around line 94-115: The entries using broad quantizer_name patterns
('*mixer.fc1_latent_proj*weight_quantizer',
'*mixer.fc1_latent_proj*input_quantizer',
'*mixer.fc2_latent_proj*weight_quantizer',
'*mixer.fc2_latent_proj*input_quantizer') are enabling FP8 for every latent
projection instead of only layers 1, 3, and 5; update these quantizer_name
values to target only the specific layer instances (e.g. include the layer
index/identifier for layers 1, 3, and 5 in the wildcard or use a regex/explicit
list) so only those mixer.fc1_latent_proj and mixer.fc2_latent_proj quantizers
are set to num_bits: e4m3, and leave all other latent projection quantizers at
BF16 (or remove the generic entries).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 60ed9c0b-efef-4967-b321-7270a8853455

📥 Commits

Reviewing files that changed from the base of the PR and between 8eec6d4 and 2a0c852.

📒 Files selected for processing (1)

modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml

codecov · 2026-04-28T18:50:07Z

Codecov Report

❌ Patch coverage is 23.84615% with 99 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.44%. Comparing base (8eec6d4) to head (6016020).
⚠️ Report is 39 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/export/unified_export_megatron.py	5.66%	50 Missing ⚠️
modelopt/torch/quantization/plugins/custom.py	5.88%	16 Missing ⚠️
modelopt/torch/quantization/model_calib.py	45.83%	13 Missing ⚠️
...odelopt/torch/quantization/qtensor/nvfp4_tensor.py	16.66%	10 Missing ⚠️
modelopt/torch/quantization/plugins/megatron.py	0.00%	9 Missing ⚠️
...delopt/torch/export/plugins/hf_checkpoint_utils.py	93.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1363      +/-   ##
==========================================
- Coverage   76.93%   70.44%   -6.50%     
==========================================
  Files         471      477       +6     
  Lines       50404    55095    +4691     
==========================================
+ Hits        38776    38809      +33     
- Misses      11628    16286    +4658

Flag	Coverage Δ
examples	`15.10% <10.07%> (-25.58%)`	⬇️
regression	`14.91% <10.07%> (+0.20%)`	⬆️
unit	`52.50% <24.21%> (-0.24%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-04-29T19:15:38Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1363/
Built to branch `gh-pages` at 2026-05-07 20:03 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/model_calib.py (1)
389-412: ⚠️ Potential issue | 🟡 Minor

Custom FP8-sweep backends still ignore the new stride setting.

When a registered backend_factory is used, this branch still calls the old 3-argument factory signature, so fp8_scale_sweep_stride only takes effect on the built-in NVFP4MSECalibrator path below. That makes the new config silently no-op for registry-backed sweep calibrators. Please extend the factory contract or reject non-default stride explicitly in the backend path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 389 - 412, The
registered backend factory path (lookup via _FP8_SWEEP_CALIBRATOR_REGISTRY and
assigned to backend_factory) currently calls the factory with the old 3-argument
signature and thus ignores fp8_scale_sweep_stride; update the branch that sets
module._calibrator via backend_factory to either (A) call the factory with the
new argument (pass fp8_scale_sweep_stride) and update the factory contract
accordingly, or (B) explicitly detect a non-default fp8_scale_sweep_stride and
raise/rollback with a clear error so users know registry-backed calibrators do
not support stride; ensure the call still passes initial_amax,
module._calibrator._axis, partial(_mse_quant_func, quantizer=module) and include
fp8_scale_sweep_stride when choosing option A, mirroring how NVFP4MSECalibrator
is constructed.

♻️ Duplicate comments (2)

modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml (1)
41-42: ⚠️ Potential issue | 🟡 Minor

The metadata description still overstates what this recipe quantizes.

These lines say attention o_proj, fc1_latent_proj, and fc2_latent_proj are FP8 per-tensor, but there are no matching overrides in quant_cfg, and the header comments say the latent MoE projections stay BF16. Please update the description so it matches the actual recipe.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around
lines 41 - 42, The metadata description in the YAML (the description field)
incorrectly claims that "attention o_proj", "fc1_latent_proj", and
"fc2_latent_proj" are FP8 per-tensor while the recipe does not contain matching
overrides in quant_cfg and header comments state latent MoE projections remain
BF16/FP16; update the description text to accurately reflect the recipe by
removing or changing the FP8 claims (e.g., state that latent MoE projections and
those specific projections remain BF16/FP16 and only list the layers that the
quant_cfg actually overrides as FP8 per-tensor).
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml (1)
34-35: ⚠️ Potential issue | 🟡 Minor

The metadata description does not match the actual quantizer mapping.

These lines still claim attention o_proj, fc1_latent_proj, and fc2_latent_proj are FP8 per-tensor, but this recipe never enables those quantizers, and the header comments above say latent MoE stays BF16. Please align the description with the quant_cfg that follows.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`
around lines 34 - 35, The YAML description string claims that "attention o_proj,
fc1_latent_proj, and fc2_latent_proj" are FP8 per-tensor, but the quant_cfg in
this recipe does not enable quantizers for "attention.o_proj",
"fc1_latent_proj", or "fc2_latent_proj" (and the header notes latent MoE stays
BF16/FP16); fix this by either updating the description to state those
projections remain BF16/FP16 (and remove the FP8 per-tensor claim) or modify
quant_cfg to actually enable FP8 per-tensor quantizers for the keys
"attention.o_proj", "fc1_latent_proj", and "fc2_latent_proj" so the comment
matches the mapping.

🧹 Nitpick comments (1)

modelopt/torch/quantization/calib/mse.py (1)
202-206: Add a regression test for strided FP8 candidate generation.

The existing coverage only exercises the default 126-candidate path. This branch adds two new behaviors—subsampling and forced inclusion of the last candidate—so it should have a focused test for fp8_scale_sweep_stride > 1 to lock down both the reduced candidate count and preservation of the max scale.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/calib/mse.py` around lines 202 - 206, The strided
FP8 candidate-generation branch guarded by fp8_scale_sweep_stride > 1 (in the
block that subsamples fp8_values into candidates and appends the last value) is
untested; add a focused regression test (e.g.,
test_fp8_scale_sweep_stride_preserves_last_candidate) that sets
fp8_scale_sweep_stride > 1, calls the code path that produces fp8_values, and
asserts that the resulting candidates length is reduced according to the stride
and that the final element equals the original fp8_values[-1] (verifying forced
inclusion of the max scale); ensure the test covers both subsampling and the
append behavior so the branch is locked down.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 16-28: Remove the unresolved Git merge conflict markers (<<<<<<<,
=======, >>>>>>>) and restore a single coherent comment block describing the
quant config; keep the more detailed version that lists both HF and
Megatron-Core names (the lines mentioning mixer.experts.<N>.{up,down}_proj,
mlp.experts.local_experts.<N>.linear_fc{1,2},
mixer.shared_experts.{up,down}_proj, and mlp.shared_experts.linear_fc{1,2}) or
merge its additional details into the shorter variant so the YAML comment is
valid and free of conflict markers.

---

Outside diff comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 389-412: The registered backend factory path (lookup via
_FP8_SWEEP_CALIBRATOR_REGISTRY and assigned to backend_factory) currently calls
the factory with the old 3-argument signature and thus ignores
fp8_scale_sweep_stride; update the branch that sets module._calibrator via
backend_factory to either (A) call the factory with the new argument (pass
fp8_scale_sweep_stride) and update the factory contract accordingly, or (B)
explicitly detect a non-default fp8_scale_sweep_stride and raise/rollback with a
clear error so users know registry-backed calibrators do not support stride;
ensure the call still passes initial_amax, module._calibrator._axis,
partial(_mse_quant_func, quantizer=module) and include fp8_scale_sweep_stride
when choosing option A, mirroring how NVFP4MSECalibrator is constructed.

---

Duplicate comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The YAML description string claims that "attention o_proj,
fc1_latent_proj, and fc2_latent_proj" are FP8 per-tensor, but the quant_cfg in
this recipe does not enable quantizers for "attention.o_proj",
"fc1_latent_proj", or "fc2_latent_proj" (and the header notes latent MoE stays
BF16/FP16); fix this by either updating the description to state those
projections remain BF16/FP16 (and remove the FP8 per-tensor claim) or modify
quant_cfg to actually enable FP8 per-tensor quantizers for the keys
"attention.o_proj", "fc1_latent_proj", and "fc2_latent_proj" so the comment
matches the mapping.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 41-42: The metadata description in the YAML (the description
field) incorrectly claims that "attention o_proj", "fc1_latent_proj", and
"fc2_latent_proj" are FP8 per-tensor while the recipe does not contain matching
overrides in quant_cfg and header comments state latent MoE projections remain
BF16/FP16; update the description text to accurately reflect the recipe by
removing or changing the FP8 claims (e.g., state that latent MoE projections and
those specific projections remain BF16/FP16 and only list the layers that the
quant_cfg actually overrides as FP8 per-tensor).

---

Nitpick comments:
In `@modelopt/torch/quantization/calib/mse.py`:
- Around line 202-206: The strided FP8 candidate-generation branch guarded by
fp8_scale_sweep_stride > 1 (in the block that subsamples fp8_values into
candidates and appends the last value) is untested; add a focused regression
test (e.g., test_fp8_scale_sweep_stride_preserves_last_candidate) that sets
fp8_scale_sweep_stride > 1, calls the code path that produces fp8_values, and
asserts that the resulting candidates length is reduced according to the stride
and that the final element equals the original fp8_values[-1] (verifying forced
inclusion of the max scale); ensure the test covers both subsampling and the
append behavior so the branch is locked down.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0d72ce13-180e-46ea-b4d8-8c6c140d22a7

📥 Commits

Reviewing files that changed from the base of the PR and between 9282cdb and f796197.

📒 Files selected for processing (5)

modelopt/torch/quantization/calib/mse.py
modelopt/torch/quantization/config.py
modelopt/torch/quantization/model_calib.py
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml

coderabbitai

♻️ Duplicate comments (2)

modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml (1)

35-36: ⚠️ Potential issue | 🟡 Minor

Metadata precision mapping is inconsistent with the actual recipe.

Line 35 says latent MoE fc1_latent_proj/fc2_latent_proj are FP8 per-tensor, but this file has no latent-MoE quantizer overrides and the header (Line 27) says they stay BF16. Please align the description with quant_cfg to avoid confusion.

Proposed patch

 metadata:
   recipe_type: ptq
-  description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj
-    FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with FP8 scale sweep.
+  description: >-
+    Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16);
+    shared experts and mamba in/out_proj FP8 per-tensor; FP8 KV cache; latent MoE,
+    lm_head, MTP, output, and mamba conv1d stay BF16; SSM cache stays FP32
+    (optionally FP16 in vLLM). Weight-MSE calibration with FP8 scale sweep.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around
lines 35 - 36, The description's claim that latent MoE
"fc1_latent_proj/fc2_latent_proj" are FP8 per-tensor is inconsistent with the
quant_cfg (and header) which leave those layers as BF16; either update the
description string to state that fc1_latent_proj and fc2_latent_proj remain
BF16/FP16 to match quant_cfg, or add explicit quantizer overrides in quant_cfg
for the fc1_latent_proj and fc2_latent_proj modules to set them to FP8
per-tensor (matching the rest of the FP8 settings); ensure the description text
and the quant_cfg entries (module names fc1_latent_proj, fc2_latent_proj) are
aligned.

modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml (1)

34-35: ⚠️ Potential issue | 🟡 Minor

Description/comments conflict on latent MoE and SSM/mamba precision.

Lines 34-35 state latent MoE is FP8 per-tensor, but the recipe does not enable latent-MoE quantizers. Also, Lines 134-135 conflict with earlier comments on SSM/mamba precision. Please make these statements internally consistent.

Proposed patch

 metadata:
   recipe_type: ptq
-  description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj
-    FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with stride-4 FP8 scale sweep.
+  description: >-
+    Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16);
+    shared experts and mamba in/out_proj FP8 per-tensor; FP8 KV cache; latent MoE,
+    lm_head, MTP, output, and mamba conv1d stay BF16; SSM cache stays FP32
+    (optionally FP16 in vLLM). Weight-MSE calibration with stride-4 FP8 scale sweep.
@@
-    # Stay BF16: lm_head, output projection, MoE routers/gates, MTP head.
-    # SSM state / mamba conv1d stay FP16.
+    # Stay BF16: lm_head, output projection, MoE routers/gates, MTP head, mamba conv1d.
+    # SSM state stays FP32 (can be set to FP16 in vLLM).

Also applies to: 134-135

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`
around lines 34 - 35, The description text claims "latent MoE is FP8 per-tensor"
and that "lm_head/MTP/SSM stay BF16/FP16" but the recipe doesn't enable
latent-MoE quantizers and later lines conflict on SSM/mamba precision; fix by
making the YAML fields match the prose: either add the latent-MoE quantizer key
(e.g., include latent_moe in the quantizers list or set
enable_latent_moe_quantizers: true) and ensure its precision is set to FP8
per-tensor, or change the description to remove the FP8 latent-MoE claim;
likewise reconcile SSM/mamba entries by updating the lm_head/MTP/SSM and mamba
precision fields to uniformly state BF16/FP16 (or change the description to
reflect the actual configured precisions) so the "description" string and the
quantizer/precision keys (latent_moe_quantizers, quantizers list, ssm_precision,
mamba_precision, lm_head_precision / mtp_precision) are consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The description text claims "latent MoE is FP8 per-tensor"
and that "lm_head/MTP/SSM stay BF16/FP16" but the recipe doesn't enable
latent-MoE quantizers and later lines conflict on SSM/mamba precision; fix by
making the YAML fields match the prose: either add the latent-MoE quantizer key
(e.g., include latent_moe in the quantizers list or set
enable_latent_moe_quantizers: true) and ensure its precision is set to FP8
per-tensor, or change the description to remove the FP8 latent-MoE claim;
likewise reconcile SSM/mamba entries by updating the lm_head/MTP/SSM and mamba
precision fields to uniformly state BF16/FP16 (or change the description to
reflect the actual configured precisions) so the "description" string and the
quantizer/precision keys (latent_moe_quantizers, quantizers list, ssm_precision,
mamba_precision, lm_head_precision / mtp_precision) are consistent.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 35-36: The description's claim that latent MoE
"fc1_latent_proj/fc2_latent_proj" are FP8 per-tensor is inconsistent with the
quant_cfg (and header) which leave those layers as BF16; either update the
description string to state that fc1_latent_proj and fc2_latent_proj remain
BF16/FP16 to match quant_cfg, or add explicit quantizer overrides in quant_cfg
for the fc1_latent_proj and fc2_latent_proj modules to set them to FP8
per-tensor (matching the rest of the FP8 settings); ensure the description text
and the quant_cfg entries (module names fc1_latent_proj, fc2_latent_proj) are
aligned.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 595b4531-3ced-4894-a5cc-c28f161c8f3e

📥 Commits

Reviewing files that changed from the base of the PR and between f796197 and 81d9d87.

📒 Files selected for processing (2)

modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml

jinhangchoi · 2026-04-29T21:28:37Z

+#     Megatron-Core names:    mlp.shared_experts.linear_fc{1,2}
+#   - Mamba mixer linears (mixer.{in,out}_proj):                   FP8 per-tensor
+#   - KV cache:                                                    FP8
+#   - Attention linears ({q,k,v}_proj):                            BF16 (not quantized)


Can we double check attention out linear? IIRC, attention o_proj should be FP8.

responded in slack, only 2/9 attention layers had o_proj FP8 in final Super NVFP4 ckpt, but we can always add it later to test if accuracy degradation is minimal

jenchen13 · 2026-05-01T20:38:47Z

/claude review

jenchen13 · 2026-05-01T20:44:46Z

@coderabbitai full review

coderabbitai · 2026-05-01T20:44:52Z

✅ Actions performed

Full review triggered.

jenchen13 · 2026-05-01T20:45:02Z

@claude review

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/export/unified_export_megatron.py (1)
1213-1229: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Record packed-expert quant metadata against the module prefix, not the weight key.

In both pack-remap helpers, prefix is the tensor key written into state_dict (for example ...weight). Recording that verbatim produces layer_config_dict entries like ...weight.quantization, and process_layer_quant_config() will then emit quantized_layers names ending in .weight instead of the HF module prefix. Serving-side layer matching will miss those packed experts.
Suggested fix
-            self._record_layer_quant_config(prefix, qformat, block_size)
+            module_prefix = prefix.rsplit(".", 1)[0] + "."
+            self._record_layer_quant_config(module_prefix, qformat, block_size)
Apply the same normalization in both _pack_name_remapping() and _pack_name_remapping_gpt_oss().
Also applies to: 1280-1298
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/unified_export_megatron.py` around lines 1213 - 1229,
The code records quant metadata under the full tensor key (e.g., "...weight")
causing layer names to end with ".weight"; update both _pack_name_remapping and
_pack_name_remapping_gpt_oss to normalize the prefix before calling
self._record_layer_quant_config by stripping the tensor suffix (e.g., remove
trailing ".weight" or the last dot component) so the module-level HF prefix is
recorded instead of the weight key; apply the same normalization logic where
self._record_layer_quant_config(prefix, qformat, block_size) is invoked in both
functions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-amax.yaml`:
- Around line 16-36: The recipe metadata/comments misstate Latent MOE behavior:
fc1_latent_proj/fc2_latent_proj are described as FP8 per-tensor but the
quant_cfg (and lines noting "stay BF16/FP16") never enable those quantizers;
either update the human-readable description to say Latent MOE projections
remain BF16 (or FP16 per existing comment) to match quant_cfg, or enable FP8
per-tensor quantizers for fc1_latent_proj and fc2_latent_proj in the quant_cfg
so the comment matches behavior—make the change by editing the
metadata/description block and/or the quant_cfg entries that reference
fc1_latent_proj and fc2_latent_proj accordingly.

In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The description erroneously claims the Latent MoE is "FP8
per-tensor" while the recipe does not enable quantizer patterns for
fc1_latent_proj or fc2_latent_proj; update the metadata description string (the
YAML "description" field) to remove or correct the FP8 statement for Latent MoE
(or explicitly state that fc1_latent_proj/fc2_latent_proj remain BF16/FP16) so
it matches the actual quantizer configuration for fc1_latent_proj and
fc2_latent_proj.

In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 35-36: The description claims latent MoE is FP8 and that static
scales are "chosen by MSE", but the YAML sets no quantizers for
fc1_latent_proj/fc2_latent_proj and uses method: max; update the human-readable
description to match the actual config by either (A) enabling the latent
projection quantizers (fc1_latent_proj/fc2_latent_proj) to make latent MoE FP8,
or (B) explicitly state latent projections remain unquantized (not FP8) and
change the phrase "static scales are chosen by MSE" to reflect the configured
method (e.g., "static scales computed by max" or "method: max"); adjust the
description lines referencing "FP8 per-tensor; ... Latent MOE
fc1_latent_proj/fc2_latent_proj" and the sentence about static scale selection
accordingly so the text and the fields (fc1_latent_proj, fc2_latent_proj, and
method: max) are consistent.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 35-36: The description string misreports the latent-MoE policy:
update the description (the YAML description field) to reflect that
fc1_latent_proj and fc2_latent_proj are left unquantized (per the quant_cfg)
rather than being "FP8 per-tensor"; edit the text around the FP8/KV/latent-MoE
sentence to explicitly state that fc1_latent_proj/fc2_latent_proj remain
unquantized (or their actual precision) and keep FP8 per-tensor/KV wording only
for the tensors that truly use FP8.

In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 815-818: The current check only treats qformat is None as
excluded, but QUANTIZATION_NONE should be treated the same so those layers are
recorded as excluded (and not dropped from the exported quant config); update
the conditional in unified_export_megatron.py where qformat is obtained from
_get_quantization_format(module) to also consider QUANTIZATION_NONE (e.g., if
qformat is None or qformat == QUANTIZATION_NONE) before calling
_record_excluded_module(prefix), and ensure any references to
_record_layer_quant_config still see these explicitly disabled quantizers as
excluded rather than omitted.

---

Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 1213-1229: The code records quant metadata under the full tensor
key (e.g., "...weight") causing layer names to end with ".weight"; update both
_pack_name_remapping and _pack_name_remapping_gpt_oss to normalize the prefix
before calling self._record_layer_quant_config by stripping the tensor suffix
(e.g., remove trailing ".weight" or the last dot component) so the module-level
HF prefix is recorded instead of the weight key; apply the same normalization
logic where self._record_layer_quant_config(prefix, qformat, block_size) is
invoked in both functions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3adef0ee-1161-4068-a796-fc857c2455fd

📥 Commits

Reviewing files that changed from the base of the PR and between 8eec6d4 and 5e32bd1.

📒 Files selected for processing (11)

modelopt/torch/export/unified_export_megatron.py
modelopt/torch/quantization/calib/mse.py
modelopt/torch/quantization/config.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/plugins/megatron.py
modelopt/torch/quantization/qtensor/nvfp4_tensor.py
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-amax.yaml
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
tests/gpu_megatron/torch/export/test_unified_export_megatron.py

jenchen13 · 2026-05-01T21:05:57Z

@CodeRabbit full review

coderabbitai · 2026-05-01T21:06:03Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

modelopt/torch/quantization/qtensor/nvfp4_tensor.py (1)

120-122: 💤 Low value

_get_static_global_amax is called twice for the static branch.

_is_static_quantizer internally calls _get_static_global_amax and discards the result; line 122 then calls it again. Consider hoisting to a single variable:

♻️ Proposed refactor

-        if cls._is_static_quantizer(weight_quantizer):
-            # Static path: use pre-computed per-block amax values from quantizer
-            global_amax = cls._get_static_global_amax(weight_quantizer).float()
+        global_amax = cls._get_static_global_amax(weight_quantizer)
+        if global_amax is not None:
+            # Static path: use pre-computed per-block amax values from quantizer
+            global_amax = global_amax.float()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py` around lines 120 - 122,
Hoist the call to _get_static_global_amax so it's executed only once: call amax
= cls._get_static_global_amax(weight_quantizer) (and .float() as needed) and
reuse that value instead of calling _get_static_global_amax again; update
_is_static_quantizer to accept an optional precomputed amax parameter or
refactor _is_static_quantizer to avoid calling _get_static_global_amax
internally so the static branch uses the hoisted amax (refer to
_is_static_quantizer, _get_static_global_amax and weight_quantizer).

tests/gpu_megatron/torch/export/test_unified_export_megatron.py (1)

151-164: ⚡ Quick win

Add one end-to-end mixed-precision export case.

The new export path here is the per-layer layer_config_dict/process_layer_quant_config flow, but this matrix still only exercises uniform configs (NVFP4_DEFAULT_CFG / FP8_DEFAULT_CFG). A single mixed recipe case would catch regressions in quantized_layers, excludes, and the config.json ↔ hf_quant_config.json parity you just tightened.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py` around lines
151 - 164, Add a new parametrized test case to the existing
pytest.mark.parametrize matrix (the tuple of ("model_type", "arch",
"extra_module", "quant_config", "kv_cache_quant_cfg")) to exercise the per-layer
mixed-precision export path: supply a quant_config that triggers the
layer_config_dict / process_layer_quant_config flow (i.e., a mixed-recipe config
rather than NVFP4_DEFAULT_CFG or FP8_DEFAULT_CFG) so the test exercises
quantized_layers, excludes handling, and the config.json ↔ hf_quant_config.json
parity; update the matrix alongside existing entries (referencing the parameters
model_type/arch/extra_module/quant_config/kv_cache_quant_cfg) so one case uses a
mixed per-layer config for either nemotron or llama.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 45-79: The two routed-expert weight quantizer entries
('*mixer.experts.*weight_quantizer' and '*mlp.experts*weight_quantizer')
currently use type: dynamic; change them to use the static-weight calibration
path by replacing type: dynamic with type: static (or remove the dynamic setting
and ensure static is explicitly set) so routed-expert weights remain static in
this max-calib variant while leaving the corresponding input_quantizer entries
unchanged.

---

Nitpick comments:
In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py`:
- Around line 120-122: Hoist the call to _get_static_global_amax so it's
executed only once: call amax = cls._get_static_global_amax(weight_quantizer)
(and .float() as needed) and reuse that value instead of calling
_get_static_global_amax again; update _is_static_quantizer to accept an optional
precomputed amax parameter or refactor _is_static_quantizer to avoid calling
_get_static_global_amax internally so the static branch uses the hoisted amax
(refer to _is_static_quantizer, _get_static_global_amax and weight_quantizer).

In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py`:
- Around line 151-164: Add a new parametrized test case to the existing
pytest.mark.parametrize matrix (the tuple of ("model_type", "arch",
"extra_module", "quant_config", "kv_cache_quant_cfg")) to exercise the per-layer
mixed-precision export path: supply a quant_config that triggers the
layer_config_dict / process_layer_quant_config flow (i.e., a mixed-recipe config
rather than NVFP4_DEFAULT_CFG or FP8_DEFAULT_CFG) so the test exercises
quantized_layers, excludes handling, and the config.json ↔ hf_quant_config.json
parity; update the matrix alongside existing entries (referencing the parameters
model_type/arch/extra_module/quant_config/kv_cache_quant_cfg) so one case uses a
mixed per-layer config for either nemotron or llama.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 94f3528f-2ee4-4757-b8de-3a323300ca74

📥 Commits

Reviewing files that changed from the base of the PR and between 8eec6d4 and 5de5541.

📒 Files selected for processing (10)

modelopt/torch/export/unified_export_megatron.py
modelopt/torch/quantization/calib/mse.py
modelopt/torch/quantization/config.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/plugins/megatron.py
modelopt/torch/quantization/qtensor/nvfp4_tensor.py
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
tests/gpu_megatron/torch/export/test_unified_export_megatron.py

coderabbitai · 2026-05-01T21:14:34Z

+    # MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale.
+    # HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj.
+    - quantizer_name: '*mixer.experts.*weight_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1
+    - quantizer_name: '*mixer.experts.*input_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1
+    # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}.
+    - quantizer_name: '*mlp.experts*weight_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1
+    - quantizer_name: '*mlp.experts*input_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep routed-expert weights static in the max-calib variant.

These two *...weight_quantizer entries are type: dynamic, so this recipe does more than swap method: mse for method: max—it changes routed-expert weights to dynamic NVFP4 and bypasses the static-weight calibration path entirely. That makes this variant a different quantization recipe, not a max-calibrated version of super-nvfp4.yaml.

Suggested fix

- quantizer_name: '*mixer.experts.*weight_quantizer' enable: true cfg: block_sizes: - type: dynamic + type: static scale_bits: e4m3 num_bits: e2m1 @@ - quantizer_name: '*mlp.experts*weight_quantizer' enable: true cfg: block_sizes: - type: dynamic + type: static scale_bits: e4m3 num_bits: e2m1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale.

# HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj.

- quantizer_name: '*mixer.experts.*weight_quantizer'

enable: true

cfg:

block_sizes:

-1: 16

type: dynamic

scale_bits: e4m3

num_bits: e2m1

- quantizer_name: '*mixer.experts.*input_quantizer'

enable: true

cfg:

block_sizes:

-1: 16

type: dynamic

scale_bits: e4m3

num_bits: e2m1

# Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}.

- quantizer_name: '*mlp.experts*weight_quantizer'

enable: true

cfg:

block_sizes:

-1: 16

type: dynamic

scale_bits: e4m3

num_bits: e2m1

- quantizer_name: '*mlp.experts*input_quantizer'

enable: true

cfg:

block_sizes:

-1: 16

type: dynamic

scale_bits: e4m3

num_bits: e2m1

# MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale.

# HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj.

- quantizer_name: '*mixer.experts.*weight_quantizer'

enable: true

cfg:

block_sizes:

-1: 16

type: static

scale_bits: e4m3

num_bits: e2m1

- quantizer_name: '*mixer.experts.*input_quantizer'

enable: true

cfg:

block_sizes:

-1: 16

type: dynamic

scale_bits: e4m3

num_bits: e2m1

# Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}.

- quantizer_name: '*mlp.experts*weight_quantizer'

enable: true

cfg:

block_sizes:

-1: 16

type: static

scale_bits: e4m3

num_bits: e2m1

- quantizer_name: '*mlp.experts*input_quantizer'

enable: true

cfg:

block_sizes:

-1: 16

type: dynamic

scale_bits: e4m3

num_bits: e2m1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml` around lines 45 - 79, The two routed-expert weight quantizer entries ('*mixer.experts.*weight_quantizer' and '*mlp.experts*weight_quantizer') currently use type: dynamic; change them to use the static-weight calibration path by replacing type: dynamic with type: static (or remove the dynamic setting and ensure static is explicitly set) so routed-expert weights remain static in this max-calib variant while leaving the corresponding input_quantizer entries unchanged.

shengliangxu · 2026-05-01T23:00:53Z

-            if hasattr(self, "kv_cache_dtype"):
-                self._hf_quant_config["quantization"]["kv_cache_quant_algo"] = self.kv_cache_dtype
+            # Use one serving-facing config for both hf_quant_config.json and config.json.
+            self._hf_quant_config = convert_hf_quant_config_format(raw_hf_quant_config)


do we change the format of hf_quant_config.json by this change?

yes to align it with the quant config inside config.json. previously hf_quant_config was still using the older json format whereas the quant config inside config.json was being converted to a newer format

Fridah-nv · 2026-05-04T21:08:20Z

+        return global_amax is not None
+
+    @classmethod
+    def _get_static_global_amax(cls, weight_quantizer):


Thanks for catching the bug! I'd like to understand why in mcore path we use _global_amax instead of global_amax, is there any general guidance for when to use private attribute name and when to use public ones? I think we have similar issue with amax vs _amax

during checkpoint resume in MCore it initializes the quantizers as generic TensorQuantizer and those have _global_amax parameter (instead of NVFP4QTensor which has global_amax)

@Fridah-nv I have changed MCore static quantizer restore to restore as NVFP4QTensor in mtq.plugins.custom

Fridah-nv

Changes in MSE flow LGTM

jenchen13 · 2026-05-07T19:41:20Z

/claude review

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

claude · 2026-05-07T19:46:26Z

                        axis=module._calibrator._axis,
                        global_amax=module.global_amax,
                        quant_func=partial(_mse_quant_func, quantizer=module),
+                        fp8_scale_sweep_stride=fp8_scale_sweep_stride,


[CRITICAL Algorithm] NVFP4MSECalibrator.__init__ does not accept a fp8_scale_sweep_stride parameter (see calib/mse.py:177-184). Passing it here will raise TypeError: __init__() got an unexpected keyword argument 'fp8_scale_sweep_stride' at runtime whenever fp8_scale_sweep=True and the quantizer is an NVFP4 static quantizer.

Either:

Add fp8_scale_sweep_stride to NVFP4MSECalibrator.__init__ and use it in _generate_candidates to subsample the candidate list (e.g., fp8_values[::stride]), or

Remove this kwarg from the call site until the calibrator supports it.

claude · 2026-05-07T19:46:32Z

-            if hasattr(self, "kv_cache_dtype"):
-                self._hf_quant_config["quantization"]["kv_cache_quant_algo"] = self.kv_cache_dtype
+            # Use one serving-facing config for both hf_quant_config.json and config.json.
+            self._hf_quant_config = convert_hf_quant_config_format(raw_hf_quant_config)


[IMPORTANT Compatibility] This changes the schema of hf_quant_config.json from the raw ModelOpt format ({"producer": {...}, "quantization": {...}}) to the compressed-tensors-compatible format ({"config_groups": {...}, "ignore": [...], "quant_algo": ...}).

The HF export path (unified_export_hf.py:1190-1194) still writes the raw format to hf_quant_config.json and the converted format to config.json. This PR makes the MCore export path diverge by writing converted format in both places.

Any downstream tools (TRT-LLM build scripts, vLLM loaders) that read hf_quant_config.json from MCore-exported checkpoints and expect the "quantization" top-level key will break. If this is intentional (which the test updates suggest), please mention the schema change in the PR description and/or changelog so consumers know to update.

claude · 2026-05-07T19:46:47Z

+        if _has_state(self.weight_quantizer, "_amax") and not _has_complete_static_nvfp4_weight_state(
+            self.weight_quantizer, self.weight
+        ):


[IMPORTANT Compatibility] The condition _has_state(wq, "_amax") and not _has_complete_static_nvfp4_weight_state(wq, self.weight) means: if _amax exists but the quantizer is NOT a complete static NVFP4, recalibrate. Previously it always recalibrated when _amax existed.

However, _has_complete_static_nvfp4_weight_state checks global_amax = getattr(quantizer, "global_amax", None) which relies on the global_amax property of NVFP4StaticQuantizer (which in turn reads _global_amax buffer). If the quantizer class hasn't been upgraded to NVFP4StaticQuantizer yet at this point during restore (because modelopt_post_restore is called before the megatron plugin's set_extra_state → from_tensor_quantizer), then global_amax will be None (plain TensorQuantizer doesn't have this property).

In that case, the function returns False, falling through to recalibrate — which is the safe/correct fallback. So this works correctly, but the "preserve MSE scales" path only activates when the quantizer has already been upgraded. Is that the intended sequence? If the HF restore path always upgrades first (via the _check_unsupported_states / from_tensor_quantizer flow), this is fine. Just flagging for awareness.

claude · 2026-05-07T19:46:52Z

+        for leaf_quantizer in _iter_leaf_quantizers(quantizer):
+            if _is_dynamic_block_quantizer(leaf_quantizer):
+                continue
+            leaf_quantizer.sync_amax_across_distributed_group(parallel_state.data_parallel_group)
+            leaf_quantizer.sync_amax_across_distributed_group(
+                parallel_state.expert_model_parallel_group


[SUGGESTION] The old code guarded sync_amax_across_distributed_group with if getattr(quantizer, "_amax", None) is not None. The new code removes this guard and relies on the internal check inside sync_amax_across_distributed_group.

While functionally correct (the method has if ... getattr(self, "_amax", None) is not None internally), this introduces unnecessary all_reduce call-site overhead for dynamic block quantizers that were already skipped by _is_dynamic_block_quantizer. However, non-dynamic quantizers that happen to not have _amax (e.g., disabled quantizers yielded by _iter_leaf_quantizers) will now call into sync_amax_across_distributed_group only to no-op internally. Minor perf consideration — no functional bug.

claude · 2026-05-07T19:47:06Z

+        self._record_layer_quant_config(q_proj_prefix, qformat, block_size)
+        self._record_layer_quant_config(k_proj_prefix, qformat, block_size)
+        self._record_layer_quant_config(v_proj_prefix, qformat, block_size)
+        if qformat in (None, QUANTIZATION_NONE):
+            # MCore stores Q/K/V as one fused linear_qkv module, but HF exports them
+            # as separate q_proj/k_proj/v_proj modules. Record the HF names so
+            # runtime quant configs do not miss excluded fused-QKV projections.
+            fused_prefix = prefix.removesuffix(".")
+            self.exclude_modules = [m for m in self.exclude_modules if m != fused_prefix]
+            self._record_excluded_module(q_proj_prefix)
+            self._record_excluded_module(k_proj_prefix)
+            self._record_excluded_module(v_proj_prefix)


[SUGGESTION] When qformat in (None, QUANTIZATION_NONE), _record_excluded_module is called with q_proj_prefix, k_proj_prefix, v_proj_prefix which all end with ".". The method correctly strips the trailing dot via removesuffix(".").

However, the list-comprehension self.exclude_modules = [m for m in self.exclude_modules if m != fused_prefix] removes the original fused module entry. If _qkv_slicing is called multiple times for the same prefix (e.g., from different pipeline stages in distributed), the second call would find fused_prefix already absent — which is fine. But the deduplication inside _record_excluded_module (the if layer_name not in self.exclude_modules check) prevents the individual proj names from being double-added too. Good.

One minor observation: if prefix is "" (empty), then fused_prefix = "".removesuffix(".") is also "", and the list-comp would try to remove "" from exclude_modules. You've guarded _record_excluded_module against empty strings, but the fused_prefix removal isn't similarly guarded. Consider adding if fused_prefix: before the list-comp to be consistent.

claude · 2026-05-07T19:47:12Z

    start_multiplier: float = 0.25,
    stop_multiplier: float = 4.0,
    fp8_scale_sweep: bool = False,
+    fp8_scale_sweep_stride: int = 1,


[IMPORTANT Compatibility] fp8_scale_sweep_stride is added to mse_calibrate() but not to MseCalibConfig (see config.py:1287). Since wrapped_calib_func calls func(model, forward_loop=forward_loop, **config.model_dump()), this parameter can never flow through the standard ModelOpt mode/config pipeline (e.g., when using apply_mode with a recipe).

For recipes like super-nvfp4.yaml to actually use fp8_scale_sweep_stride, it needs to be added to MseCalibConfig (or a new config field) and parsed from the recipe YAML algorithm section. Without this, the parameter is only usable via direct mse_calibrate() calls, which limits the recipe-driven workflow.

claude

Code Review Summary

Findings: CRITICAL: 1, IMPORTANT: 3, SUGGESTION: 2

Critical

fp8_scale_sweep_stride passed to NVFP4MSECalibrator which doesn't accept it (model_calib.py:412). This will raise a TypeError at runtime whenever fp8_scale_sweep=True and the quantizer matches the NVFP4 static path. The constructor at calib/mse.py:177-184 has no such parameter.

Important

hf_quant_config.json schema change (unified_export_megatron.py:354). MCore exports now write the converted (compressed-tensors) format to hf_quant_config.json, breaking downstream tools that expect the raw {"producer": {...}, "quantization": {...}} format. The HF export path still writes the old format to this file.
fp8_scale_sweep_stride not in MseCalibConfig (model_calib.py:335). The parameter can only be used via direct mse_calibrate() calls, not through the standard recipe/mode pipeline, limiting its utility for the new YAML recipes.
Static NVFP4 preserve-amax path timing (plugins/custom.py:164). The "preserve MSE scales" logic depends on the quantizer already being upgraded to NVFP4StaticQuantizer. If upgrade hasn't happened yet (depends on restore ordering), it safely falls through to recalibration — but this means MSE scales may not be preserved in all restore paths.

Suggestions

Guard fused_prefix removal against empty string (unified_export_megatron.py:1074).
Minor perf: removed _amax guard (model_calib.py:196-201). Functionally correct but calls into sync method for quantizers without _amax.

Overall Assessment

Risk: Medium-High. The critical TypeError bug (#1) will crash any user path that invokes mse_calibrate with fp8_scale_sweep=True on a model with NVFP4 static quantizers — which is the exact use case of the new super-nvfp4.yaml recipe. The hf_quant_config.json schema change (#2) is a compatibility concern for consumers of MCore-exported checkpoints.

The core logic changes (dynamic block quantizer detection fix, _global_amax fallback for restored static quantizers, per-layer quant config collection for mixed-precision export, QKV exclude expansion) are well-reasoned and the new unit tests cover the key behaviors.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

jenchen13 requested a review from a team as a code owner April 28, 2026 18:33

jenchen13 requested a review from h-guo18 April 28, 2026 18:33

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml Outdated

Comment thread modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml Outdated

jenchen13 requested a review from realAsma April 28, 2026 18:40

jenchen13 changed the title ~~Add nemotron Super v3 NVFP4 PTQ recipe~~ Add Nemotron Super v3 NVFP4 PTQ recipe Apr 28, 2026

jenchen13 requested a review from a team as a code owner April 29, 2026 19:11

jenchen13 changed the title ~~Add Nemotron Super v3 NVFP4 PTQ recipe~~ Fix dynamic block quantizer detection & MSE MOE calibration; Add Nemotron Super v3 NVFP4 PTQ recipe Apr 29, 2026

jenchen13 requested review from Fridah-nv and meenchen April 29, 2026 19:15

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml Outdated

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

jinhangchoi reviewed Apr 29, 2026

View reviewed changes

jenchen13 requested a review from a team as a code owner May 1, 2026 19:29

jenchen13 changed the title ~~Fix dynamic block quantizer detection & MSE MOE calibration; Add Nemotron Super v3 NVFP4 PTQ recipe~~ Support Mixed precision & Static MSE PTQ in MCore export; Add Nemotron Super v3 NVFP4 PTQ recipe May 1, 2026

jenchen13 changed the title ~~Support Mixed precision & Static MSE PTQ in MCore export; Add Nemotron Super v3 NVFP4 PTQ recipe~~ Support Mixed precision & Static MSE PTQ in MCore export; Nemotron Super v3 NVFP4 recipe May 1, 2026

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

shengliangxu reviewed May 1, 2026

View reviewed changes

Fridah-nv reviewed May 4, 2026

View reviewed changes

Fridah-nv approved these changes May 4, 2026

View reviewed changes

MSE fixes for mcore and fix HF export and add Super NVFP4 yaml recipe

8fc569d

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

jenchen13 force-pushed the jennifchen/super_nvfp4_recipe branch from 1a10eeb to 8fc569d Compare May 7, 2026 19:43

claude Bot reviewed May 7, 2026

View reviewed changes

jenchen13 added 2 commits May 7, 2026 12:52

remove stride implementation

a6dbf23

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

lint

6016020

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

Conversation

jenchen13 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Super recipe

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-07 20:03 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

jinhangchoi Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

jenchen13 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jenchen13 commented May 1, 2026

Uh oh!

jenchen13 commented May 1, 2026

Uh oh!

coderabbitai Bot commented May 1, 2026

Uh oh!

jenchen13 commented May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jenchen13 commented May 1, 2026

Uh oh!

coderabbitai Bot commented May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

shengliangxu May 1, 2026

Choose a reason for hiding this comment

Uh oh!

jenchen13 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Fridah-nv May 4, 2026

Choose a reason for hiding this comment

Uh oh!

jenchen13 commented Apr 28, 2026 •

edited

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

codecov Bot commented Apr 28, 2026 •

edited

Loading

github-actions Bot commented Apr 29, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-07 20:03 UTC.
Preview will be ready when the GitHub Pages deployment is complete.